Metrics with Prometheus

What You Will Learn

The Prometheus data model and why it differs from traditional monitoring systems
All four Prometheus metric types with production use cases and common mistakes
How to expose metrics from a FastAPI service with auto-instrumentation
How to write custom application metrics for a document processing service
Ten production PromQL queries every SRE should know
How to write Alertmanager rules and Grafana dashboards from JSON

Prerequisites

Requirement	Details
Python 3.11+	Type hints used throughout
FastAPI basics	All examples instrument FastAPI
`prometheus-client`, `prometheus-fastapi-instrumentator`	`pip install prometheus-client prometheus-fastapi-instrumentator`
Docker + docker-compose	Prometheus, Alertmanager, Grafana run in containers
Lesson 01 complete	Logging context assumed

The Incident: 3 AM, p99 > 2 Seconds

PagerDuty fires at 03:17. The alert: p99 latency > 2s on document-api. Your on-call rotation just woke you up.

You open the logs. They look like this:

INFO  POST /api/documents 1891ms
INFO  POST /api/documents 2103ms
INFO  POST /api/documents 847ms
INFO  POST /api/documents 3201ms

Slow, but no errors. You have no metrics. Without metrics, your investigation looks like this:

03:17 - Wake up, look at logs
03:22 - Try to reproduce locally (cannot, it's a production load pattern)
03:31 - Start adding time.time() instrumentation to guess where the slowness is
03:48 - Realise you need to deploy the change and wait for production traffic
04:02 - Still investigating; users are complaining

Now imagine you have a Prometheus histogram tracking latency by route and a custom metric tracking document_size_bytes as a label. A single PromQL query at 03:17 shows:

histogram_quantile(0.99,
  sum by (le, http_route, document_size_bucket) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

The result: p99 is high only on POST /api/documents when document_size_bucket is "large". Large documents are slow. The entire root cause analysis takes 90 seconds.

This lesson is about having those metrics in place before you need them.

1. The Prometheus Data Model

Prometheus stores time series - sequences of (timestamp, float64) pairs, identified by a metric name and a set of labels.

http_request_duration_seconds_bucket{
    method="POST",
    route="/api/documents",
    status_code="200",
    le="0.5"
} = 1847

Metric name: http_request_duration_seconds_bucket
Labels: key-value pairs that give the metric its dimensions
Value: a float64 (here, the count of requests with duration ≤ 0.5s)
Timestamp: added by Prometheus at scrape time

Cardinality: The Critical Constraint

Every unique combination of label values creates a new time series. This is cardinality. High cardinality destroys Prometheus performance.

# CORRECT: Low cardinality labels
request_counter.labels(
    method="POST",
    route="/api/documents",  # Fixed set of routes
    status_code="200",       # 5xx, 4xx, 2xx or exact codes
).inc()

# WRONG: High cardinality - explodes Prometheus
request_counter.labels(
    user_id="usr_4492",      # Millions of unique user IDs!
    document_id="doc_8f3a",  # Billions of documents!
    request_id="req_7e9d",   # Unique per request!
).inc()

Rule: Labels should have bounded cardinality. Anything with more than ~100 unique values is suspicious. Anything per-user or per-request belongs in logs, not metrics.

The Prometheus Scrape Model

Prometheus pulls metrics by scraping HTTP endpoints at regular intervals (typically 15s). Your Python service exposes a /metrics endpoint; Prometheus polls it.

┌─────────────────┐    scrape every 15s    ┌─────────────────┐
│  Python Service │ ─────────────────────► │   Prometheus    │
│  :8001/metrics  │                        │  (stores TSDB)  │
└─────────────────┘                        └────────┬────────┘
                                                    │ query
                                                    ▼
                                           ┌─────────────────┐
                                           │     Grafana     │
                                           │   Dashboards    │
                                           └─────────────────┘

2. Counter

A counter is a monotonically increasing value. It only goes up (and resets to zero when the process restarts). Use it for events that accumulate: requests, errors, bytes processed, messages consumed.

from prometheus_client import Counter

# Define at module level - Prometheus registers metrics globally
http_requests_total = Counter(
    "http_requests_total",               # Metric name
    "Total HTTP requests received",       # Help text (shown in /metrics)
    ["method", "route", "status_code"],  # Label names
)

errors_total = Counter(
    "app_errors_total",
    "Total application errors by type",
    ["error_type", "component"],
)

documents_processed_total = Counter(
    "documents_processed_total",
    "Total documents processed",
    ["content_type", "status"],
)

Using Counters

# Increment by 1 (most common)
http_requests_total.labels(
    method="POST",
    route="/api/documents",
    status_code="200",
).inc()

# Increment by N
documents_processed_total.labels(
    content_type="application/pdf",
    status="success",
).inc(batch_size)

# Track exceptions
try:
    result = process_document(doc)
except ValidationError as e:
    errors_total.labels(
        error_type="ValidationError",
        component="document_processor",
    ).inc()
    raise

PromQL for Counters

# Requests per second over the last 5 minutes
rate(http_requests_total[5m])

# Error rate (errors per second)
rate(app_errors_total[5m])

# Error percentage
(
  rate(http_requests_total{status_code=~"5.."}[5m])
  /
  rate(http_requests_total[5m])
) * 100

# Total requests in the last hour (handles counter resets)
increase(http_requests_total[1h])

# Top 5 error types
topk(5, sum by (error_type) (rate(app_errors_total[5m])))

Counter Pitfalls

# WRONG: Using a counter for something that can go down
active_connections = Counter("active_connections", "...")  # Wrong!
# This cannot go down - use a Gauge

# WRONG: Calling .inc() with a float > 1 when you mean events
requests.inc(response_time)  # Wrong - this is not counting requests
# Use a Histogram for response time

# CORRECT: Counting bytes (a counter, because bytes never decrease)
bytes_sent = Counter("bytes_sent_total", "Total bytes sent")
bytes_sent.inc(len(response_body))

3. Gauge

A gauge is a value that can go up and down. Use it for current state: active connections, queue depth, memory usage, number of items in a cache, temperature.

from prometheus_client import Gauge

# Database connection pool
db_connections_active = Gauge(
    "db_connections_active",
    "Number of currently active database connections",
    ["pool_name"],
)

db_connections_idle = Gauge(
    "db_connections_idle",
    "Number of idle database connections in the pool",
    ["pool_name"],
)

# Message queue
queue_depth = Gauge(
    "document_queue_depth",
    "Number of documents waiting to be processed",
    ["queue_name", "priority"],
)

# Memory (custom, in addition to process_* metrics Prometheus provides)
model_memory_bytes = Gauge(
    "ml_model_memory_bytes",
    "Memory used by loaded ML models",
    ["model_name", "version"],
)

Using Gauges

# Set to a specific value
db_connections_active.labels(pool_name="primary").set(pool.checked_out)
db_connections_idle.labels(pool_name="primary").set(pool.idle_count)

# Increment and decrement
queue_depth.labels(queue_name="document_processing", priority="high").inc()
# ... after processing:
queue_depth.labels(queue_name="document_processing", priority="high").dec()

# Use as a context manager - automatically inc on enter, dec on exit
with queue_depth.labels(queue_name="document_processing", priority="high").track_inprogress():
    process_document(doc)

# Track function execution time as a gauge
@model_memory_bytes.labels(model_name="classifier", version="1.0").track_inprogress()
def load_model():
    ...

# Set to current Unix timestamp - useful for "last successful run" gauges
last_backup_timestamp = Gauge(
    "last_backup_timestamp_seconds",
    "Unix timestamp of the last successful database backup",
)
last_backup_timestamp.set_to_current_time()

PromQL for Gauges

# Current queue depth
document_queue_depth{queue_name="document_processing"}

# Connection pool utilisation percentage
(
  db_connections_active{pool_name="primary"}
  /
  (db_connections_active{pool_name="primary"} + db_connections_idle{pool_name="primary"})
) * 100

# Time since last backup (seconds)
time() - last_backup_timestamp_seconds

# Alert if queue depth has been high for 5 minutes
document_queue_depth > 100

4. Histogram

A histogram is the most powerful and most misunderstood Prometheus metric type. It tracks the distribution of observed values (like request durations) across configurable buckets.

How Histograms Work

For each observation (e.g., a request that took 347ms), Prometheus increments:

All _bucket counters where le (less than or equal) >= the observed value
The _count counter (total observations)
The _sum counter (sum of all observed values)

http_request_duration_seconds_bucket{le="0.1"}  = 892   # requests <= 100ms
http_request_duration_seconds_bucket{le="0.25"} = 1841  # requests <= 250ms
http_request_duration_seconds_bucket{le="0.5"}  = 2103  # requests <= 500ms
http_request_duration_seconds_bucket{le="1.0"}  = 2144  # requests <= 1000ms
http_request_duration_seconds_bucket{le="2.5"}  = 2147  # requests <= 2500ms
http_request_duration_seconds_bucket{le="+Inf"} = 2147  # all requests
http_request_duration_seconds_count = 2147
http_request_duration_seconds_sum   = 892.4

Defining Histograms with the Right Buckets

Bucket selection is critical. If your SLO is "p99 < 500ms", you need buckets around 500ms to calculate it accurately.

from prometheus_client import Histogram

# Request latency - buckets in seconds, chosen for a web API SLO of p99 < 1s
http_request_duration_seconds = Histogram(
    "http_request_duration_seconds",
    "HTTP request duration in seconds",
    ["method", "route", "status_class"],
    # Dense around the SLO target, sparse elsewhere
    buckets=[
        0.005,   # 5ms
        0.01,    # 10ms
        0.025,   # 25ms
        0.05,    # 50ms
        0.1,     # 100ms
        0.25,    # 250ms
        0.5,     # 500ms - SLO boundary
        0.75,    # 750ms
        1.0,     # 1s
        2.5,     # 2.5s
        5.0,     # 5s
        10.0,    # 10s
        float("inf"),  # Always include +Inf
    ],
)

# Document processing time - buckets in seconds, much wider range
document_processing_seconds = Histogram(
    "document_processing_seconds",
    "Time to fully process a document",
    ["content_type", "page_count_bucket"],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0, float("inf")],
)

# Model inference latency - tight buckets for fast ML inference
model_inference_seconds = Histogram(
    "model_inference_seconds",
    "Time for ML model inference",
    ["model_name", "batch_size_bucket"],
    buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, float("inf")],
)

Using Histograms

import time
from contextlib import contextmanager

# Method 1: Manual timing
start = time.perf_counter()
result = process_document(doc)
duration = time.perf_counter() - start
document_processing_seconds.labels(
    content_type="application/pdf",
    page_count_bucket="10-50",
).observe(duration)

# Method 2: Context manager (cleaner)
with document_processing_seconds.labels(
    content_type="application/pdf",
    page_count_bucket="10-50",
).time():
    result = process_document(doc)

# Method 3: Decorator
@http_request_duration_seconds.labels(
    method="POST",
    route="/api/documents",
    status_class="2xx",
).time()
def handle_upload(request):
    ...

PromQL for Histograms

# p50, p95, p99 latency over last 5 minutes
histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

# p99 broken down by route - find the slow endpoint
histogram_quantile(0.99,
  sum by (le, route) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

# Average request duration (sum / count)
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])

# Percentage of requests completing within 500ms
(
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
  /
  sum(rate(http_request_duration_seconds_count[5m]))
) * 100

Choosing Bucket Boundaries

Scenario	Recommended Buckets (seconds)
Real-time API (SLO: p99 < 100ms)	0.001, 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 1.0, +Inf
Standard API (SLO: p99 < 1s)	0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, +Inf
Background jobs (SLO: p99 < 30s)	0.1, 0.5, 1.0, 5.0, 10.0, 15.0, 30.0, 60.0, 120.0, +Inf
Batch processing (SLO: p99 < 5min)	1.0, 5.0, 15.0, 30.0, 60.0, 120.0, 300.0, 600.0, +Inf

5. Summary: When to Use vs Histogram

Aspect	Histogram	Summary
Quantile calculation	Server-side in Prometheus (PromQL)	Client-side in the application
Aggregation across instances	Yes - `sum by (le)` across replicas	No - quantiles cannot be summed
Memory cost	Fixed (number of buckets × label combinations)	Grows with sliding window size
Accuracy	Approximate (depends on bucket boundaries)	Configurable precision
Best for	Most use cases; SLO alerting	Single-instance services; precise quantiles needed locally

Recommendation: Use Histogram for almost everything. Use Summary only if you have a single-instance service and need exact quantiles locally, and you will never need to aggregate across multiple replicas.

from prometheus_client import Summary

# Summary example - avoid in multi-instance deployments
request_latency_summary = Summary(
    "request_latency_seconds",
    "Request latency",
    ["route"],
)

# Declares 0.5, 0.9, 0.99 quantiles by default
# This calculates quantiles IN the Python process, over a sliding window

6. FastAPI Auto-Instrumentation

prometheus-fastapi-instrumentator automatically instruments every route with request count, latency histograms, and in-progress request gauges.

# pip install prometheus-fastapi-instrumentator
from fastapi import FastAPI
from prometheus_fastapi_instrumentator import Instrumentator, metrics

app = FastAPI()

# Create instrumentator with production settings
instrumentator = Instrumentator(
    # Exclude internal endpoints from metrics
    should_group_status_codes=True,
    should_ignore_untemplated=True,  # Ignore routes without path params defined
    should_respect_env_var=True,     # Disable via ENABLE_METRICS=false
    excluded_handlers=[
        "/metrics",     # Don't track the metrics endpoint itself
        "/health",      # Don't track health checks
        "/liveness",
        "/readiness",
        "/docs",
        "/openapi.json",
    ],
    body_handlers=None,
    inprogress_name="http_requests_inprogress",
    inprogress_labels=True,
)

# Add default metrics (latency histogram, request count, in-progress)
instrumentator.add(
    metrics.request_size(
        metric_name="http_request_size_bytes",
        metric_doc="HTTP request body size in bytes",
    )
).add(
    metrics.response_size(
        metric_name="http_response_size_bytes",
        metric_doc="HTTP response body size in bytes",
    )
).add(
    metrics.latency(
        metric_name="http_request_duration_seconds",
        metric_doc="HTTP request latency in seconds",
        buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
    )
).add(
    metrics.requests(
        metric_name="http_requests_total",
        metric_doc="Total HTTP requests",
    )
)

# Mount /metrics endpoint (must be before app.include_router calls)
instrumentator.instrument(app).expose(
    app,
    endpoint="/metrics",
    include_in_schema=False,
    tags=["monitoring"],
)

What the /metrics Endpoint Produces

After instrumentation, GET /metrics returns:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{handler="/api/documents",method="POST",status="2xx"} 2147.0
http_requests_total{handler="/api/documents",method="POST",status="4xx"} 13.0
http_requests_total{handler="/api/documents",method="POST",status="5xx"} 2.0

# HELP http_request_duration_seconds HTTP request latency in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{handler="/api/documents",le="0.005"} 0.0
http_request_duration_seconds_bucket{handler="/api/documents",le="0.01"} 12.0
http_request_duration_seconds_bucket{handler="/api/documents",le="0.025"} 89.0
...
http_request_duration_seconds_count{handler="/api/documents"} 2162.0
http_request_duration_seconds_sum{handler="/api/documents"} 1087.4

# HELP http_requests_inprogress HTTP requests currently in progress
# TYPE http_requests_inprogress gauge
http_requests_inprogress{handler="/api/documents",method="POST"} 3.0

# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 4.26844160e+08

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 8731.0
python_gc_objects_collected_total{generation="1"} 492.0
python_gc_objects_collected_total{generation="2"} 12.0

7. Custom Application Metrics

Auto-instrumentation covers HTTP-level metrics. You also need application-level metrics that reflect your business logic.

Complete Metrics Module for a Document Processing Service

# app/metrics.py
"""
Application-level Prometheus metrics.

Import this module once at startup. Metric objects are singletons
registered in the global Prometheus registry.
"""
from prometheus_client import Counter, Gauge, Histogram, Info

# ─── Document Processing ────────────────────────────────────────────────────

documents_received_total = Counter(
    "documents_received_total",
    "Total documents received for processing",
    ["content_type", "source"],
)

documents_processed_total = Counter(
    "documents_processed_total",
    "Total documents that completed processing",
    ["content_type", "status"],  # status: success | validation_error | processing_error
)

document_processing_duration_seconds = Histogram(
    "document_processing_duration_seconds",
    "End-to-end time to process a document",
    ["content_type", "page_count_bucket"],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, float("inf")],
)

document_size_bytes = Histogram(
    "document_size_bytes",
    "Size of received documents in bytes",
    ["content_type"],
    buckets=[
        1_024,          # 1 KB
        10_240,         # 10 KB
        102_400,        # 100 KB
        1_048_576,      # 1 MB
        10_485_760,     # 10 MB
        52_428_800,     # 50 MB
        float("inf"),
    ],
)

documents_in_queue = Gauge(
    "documents_in_queue",
    "Number of documents currently waiting in the processing queue",
    ["priority"],
)

# ─── ML Model Metrics ───────────────────────────────────────────────────────

model_inference_duration_seconds = Histogram(
    "model_inference_duration_seconds",
    "Time for model inference",
    ["model_name", "model_version"],
    buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, float("inf")],
)

model_inference_requests_total = Counter(
    "model_inference_requests_total",
    "Total model inference requests",
    ["model_name", "model_version", "status"],
)

# ─── Cache Metrics ──────────────────────────────────────────────────────────

cache_operations_total = Counter(
    "cache_operations_total",
    "Total cache operations",
    ["cache_name", "operation", "result"],  # result: hit | miss | error
)

cache_items_current = Gauge(
    "cache_items_current",
    "Current number of items in the cache",
    ["cache_name"],
)

# ─── Database Metrics ───────────────────────────────────────────────────────

db_query_duration_seconds = Histogram(
    "db_query_duration_seconds",
    "Database query execution time",
    ["operation", "table"],  # operation: select | insert | update | delete
    buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, float("inf")],
)

db_connections_active = Gauge(
    "db_connections_active",
    "Active database connections",
    ["pool_name"],
)

db_connections_idle = Gauge(
    "db_connections_idle",
    "Idle database connections",
    ["pool_name"],
)

db_connection_pool_size = Gauge(
    "db_connection_pool_size",
    "Total database connection pool size",
    ["pool_name"],
)

# ─── External API Metrics ───────────────────────────────────────────────────

external_api_duration_seconds = Histogram(
    "external_api_duration_seconds",
    "Time to complete external API calls",
    ["service", "operation", "status_class"],
    buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, float("inf")],
)

external_api_errors_total = Counter(
    "external_api_errors_total",
    "Total external API errors",
    ["service", "operation", "error_type"],
)

# ─── Service Info ───────────────────────────────────────────────────────────

service_info = Info(
    "service",
    "Service metadata",
)

def initialise_service_info(name: str, version: str, environment: str) -> None:
    """Call once at startup to set service metadata in metrics."""
    service_info.info({
        "name": name,
        "version": version,
        "environment": environment,
    })

Using Metrics in Application Code

# app/services/document_processor.py
import time
from app import metrics

class DocumentProcessor:

    async def process(self, content: bytes, content_type: str, priority: str) -> Document:
        # Track document receipt
        metrics.documents_received_total.labels(
            content_type=content_type,
            source="api",
        ).inc()

        metrics.document_size_bytes.labels(
            content_type=content_type,
        ).observe(len(content))

        metrics.documents_in_queue.labels(priority=priority).inc()

        start = time.perf_counter()
        try:
            doc = await self._do_process(content, content_type)
            status = "success"
            return doc
        except ValidationError:
            status = "validation_error"
            raise
        except Exception:
            status = "processing_error"
            raise
        finally:
            duration = time.perf_counter() - start
            page_count = getattr(doc, "page_count", 0) if status == "success" else 0

            metrics.documents_processed_total.labels(
                content_type=content_type,
                status=status,
            ).inc()

            metrics.document_processing_duration_seconds.labels(
                content_type=content_type,
                page_count_bucket=_bucket_page_count(page_count),
            ).observe(duration)

            metrics.documents_in_queue.labels(priority=priority).dec()

    async def _do_model_inference(self, text: str, model_name: str) -> dict:
        start = time.perf_counter()
        try:
            result = await self.model.predict(text)
            metrics.model_inference_requests_total.labels(
                model_name=model_name,
                model_version="1.0",
                status="success",
            ).inc()
            return result
        except Exception:
            metrics.model_inference_requests_total.labels(
                model_name=model_name,
                model_version="1.0",
                status="error",
            ).inc()
            raise
        finally:
            duration = time.perf_counter() - start
            metrics.model_inference_duration_seconds.labels(
                model_name=model_name,
                model_version="1.0",
            ).observe(duration)


def _bucket_page_count(pages: int) -> str:
    if pages == 0:
        return "unknown"
    if pages <= 5:
        return "1-5"
    if pages <= 20:
        return "6-20"
    if pages <= 100:
        return "21-100"
    return "100+"

8. PromQL Essentials

Ten queries every SRE working with Python services needs to know:

# 1. Request rate per second, last 5 minutes, by route
sum by (handler) (
  rate(http_requests_total[5m])
)

# 2. Error rate as a percentage of total requests
(
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100

# 3. p50/p95/p99 latency by route
histogram_quantile(0.99,
  sum by (le, handler) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

# 4. Apdex score (target = 500ms, tolerable = 2s)
# Apdex = (satisfied + 0.5 * tolerating) / total
(
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
  +
  0.5 * (
    sum(rate(http_request_duration_seconds_bucket{le="2.0"}[5m]))
    -
    sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
  )
) / sum(rate(http_request_duration_seconds_count[5m]))

# 5. Database connection pool saturation
(
  db_connections_active{pool_name="primary"}
  /
  db_connection_pool_size{pool_name="primary"}
) * 100

# 6. Cache hit rate
(
  sum(rate(cache_operations_total{result="hit"}[5m]))
  /
  sum(rate(cache_operations_total{result=~"hit|miss"}[5m]))
) * 100

# 7. Document processing throughput (docs/sec)
sum(rate(documents_processed_total{status="success"}[5m]))

# 8. Average document processing time
rate(document_processing_duration_seconds_sum[5m])
/
rate(document_processing_duration_seconds_count[5m])

# 9. External API error rate by service
sum by (service) (
  rate(external_api_errors_total[5m])
)

# 10. Python GC pause rate (from auto-collected process metrics)
rate(python_gc_collections_total[5m])

9. Alerting Rules

Alerting rules are evaluated by Prometheus at a configurable interval. When a rule's expression evaluates to a non-empty set of time series, Prometheus fires the alert to Alertmanager.

docker-compose Setup

# docker-compose.yml additions
prometheus:
  image: prom/prometheus:v2.50.1
  ports:
    - "9090:9090"
  volumes:
    - ./config/prometheus.yml:/etc/prometheus/prometheus.yml
    - ./config/alerts.yml:/etc/prometheus/alerts.yml
    - prometheus_data:/prometheus
  command:
    - "--config.file=/etc/prometheus/prometheus.yml"
    - "--storage.tsdb.retention.time=15d"

alertmanager:
  image: prom/alertmanager:v0.26.0
  ports:
    - "9093:9093"
  volumes:
    - ./config/alertmanager.yml:/etc/alertmanager/alertmanager.yml

# config/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: "document-api"
    static_configs:
      - targets: ["app:8001"]
    metrics_path: "/metrics"

Five Production Alerting Rules

# config/alerts.yml
groups:
  - name: document_api_alerts
    interval: 30s
    rules:

      # 1. High Error Rate
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) * 100 > 1
        for: 2m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High HTTP error rate on {{ $labels.job }}"
          description: >
            Error rate is {{ $value | printf "%.2f" }}%
            (threshold: 1%) for the last 2 minutes.
          runbook: "https://wiki.example.com/runbooks/high-error-rate"
          dashboard: "https://grafana.example.com/d/abc123"

      # 2. High p99 Latency
      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99,
            sum by (le, handler) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          ) > 2.0
        for: 3m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "p99 latency > 2s on {{ $labels.handler }}"
          description: >
            p99 latency is {{ $value | printf "%.3f" }}s
            on route {{ $labels.handler }}.
            SLO threshold is 1.0s.

      # 3. Low Cache Hit Rate
      - alert: LowCacheHitRate
        expr: |
          (
            sum by (cache_name) (rate(cache_operations_total{result="hit"}[10m]))
            /
            sum by (cache_name) (rate(cache_operations_total{result=~"hit|miss"}[10m]))
          ) * 100 < 60
        for: 5m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "Cache hit rate below 60% on {{ $labels.cache_name }}"
          description: >
            Cache hit rate is {{ $value | printf "%.1f" }}%
            on {{ $labels.cache_name }}.
            This may indicate cache invalidation issues or increased unique traffic.

      # 4. Database Connection Pool Near Exhaustion
      - alert: DatabaseConnectionPoolSaturation
        expr: |
          (
            db_connections_active{pool_name="primary"}
            /
            db_connection_pool_size{pool_name="primary"}
          ) * 100 > 80
        for: 1m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "DB connection pool > 80% utilised"
          description: >
            {{ $value | printf "%.1f" }}% of the primary database connection
            pool is in use. At 100%, new requests will queue or fail.
            Current active: {{ with query "db_connections_active{pool_name='primary'}" }}{{ . | first | value | printf "%.0f" }}{{ end }}

      # 5. Service Down (no scrape data for 2 minutes)
      - alert: ServiceDown
        expr: |
          up{job="document-api"} == 0
        for: 2m
        labels:
          severity: critical
          team: backend
          page: "true"
        annotations:
          summary: "document-api is unreachable"
          description: >
            Prometheus cannot scrape {{ $labels.instance }}.
            The service may be crashed or the /metrics endpoint is broken.

Alertmanager Configuration

# config/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"

route:
  group_by: ["alertname", "job", "severity"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "slack-critical"

  routes:
    - match:
        severity: critical
        page: "true"
      receiver: "pagerduty"
    - match:
        severity: warning
      receiver: "slack-warnings"

receivers:
  - name: "slack-critical"
    slack_configs:
      - channel: "#incidents"
        send_resolved: true
        title: "{{ .CommonAnnotations.summary }}"
        text: "{{ .CommonAnnotations.description }}"

  - name: "slack-warnings"
    slack_configs:
      - channel: "#alerts"
        send_resolved: true

  - name: "pagerduty"
    pagerduty_configs:
      - routing_key: "YOUR_PAGERDUTY_INTEGRATION_KEY"
        description: "{{ .CommonAnnotations.summary }}"

10. Grafana Dashboard

A complete dashboard JSON for a document processing service. Import this via Grafana UI → Dashboards → Import → Paste JSON.

{
  "title": "Document API - Service Dashboard",
  "uid": "doc-api-v1",
  "schemaVersion": 38,
  "time": {"from": "now-1h", "to": "now"},
  "refresh": "30s",
  "panels": [
    {
      "id": 1,
      "title": "Request Rate (req/s)",
      "type": "timeseries",
      "gridPos": {"x": 0, "y": 0, "w": 6, "h": 8},
      "targets": [{
        "expr": "sum(rate(http_requests_total[5m]))",
        "legendFormat": "Total req/s"
      }]
    },
    {
      "id": 2,
      "title": "Error Rate (%)",
      "type": "timeseries",
      "gridPos": {"x": 6, "y": 0, "w": 6, "h": 8},
      "targets": [{
        "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
        "legendFormat": "5xx Error Rate %"
      }],
      "fieldConfig": {
        "defaults": {"thresholds": {"steps": [
          {"color": "green", "value": 0},
          {"color": "yellow", "value": 0.1},
          {"color": "red", "value": 1}
        ]}}
      }
    },
    {
      "id": 3,
      "title": "Latency Percentiles",
      "type": "timeseries",
      "gridPos": {"x": 12, "y": 0, "w": 12, "h": 8},
      "targets": [
        {
          "expr": "histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))",
          "legendFormat": "p50"
        },
        {
          "expr": "histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))",
          "legendFormat": "p95"
        },
        {
          "expr": "histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))",
          "legendFormat": "p99"
        }
      ]
    },
    {
      "id": 4,
      "title": "Active Connections",
      "type": "stat",
      "gridPos": {"x": 0, "y": 8, "w": 4, "h": 4},
      "targets": [{
        "expr": "http_requests_inprogress",
        "legendFormat": "In Progress"
      }]
    },
    {
      "id": 5,
      "title": "DB Pool Utilisation %",
      "type": "gauge",
      "gridPos": {"x": 4, "y": 8, "w": 4, "h": 4},
      "targets": [{
        "expr": "(db_connections_active{pool_name=\"primary\"} / db_connection_pool_size{pool_name=\"primary\"}) * 100",
        "legendFormat": "Pool Util %"
      }],
      "fieldConfig": {
        "defaults": {
          "min": 0, "max": 100, "unit": "percent",
          "thresholds": {"steps": [
            {"color": "green", "value": 0},
            {"color": "yellow", "value": 60},
            {"color": "red", "value": 80}
          ]}
        }
      }
    },
    {
      "id": 6,
      "title": "Cache Hit Rate %",
      "type": "gauge",
      "gridPos": {"x": 8, "y": 8, "w": 4, "h": 4},
      "targets": [{
        "expr": "(sum(rate(cache_operations_total{result=\"hit\"}[5m])) / sum(rate(cache_operations_total{result=~\"hit|miss\"}[5m]))) * 100",
        "legendFormat": "Hit Rate %"
      }]
    },
    {
      "id": 7,
      "title": "Memory Usage (RSS)",
      "type": "timeseries",
      "gridPos": {"x": 0, "y": 12, "w": 8, "h": 8},
      "targets": [{
        "expr": "process_resident_memory_bytes",
        "legendFormat": "RSS Memory"
      }],
      "fieldConfig": {"defaults": {"unit": "bytes"}}
    },
    {
      "id": 8,
      "title": "Python GC Collections/s",
      "type": "timeseries",
      "gridPos": {"x": 8, "y": 12, "w": 8, "h": 8},
      "targets": [{
        "expr": "sum by (generation) (rate(python_gc_collections_total[5m]))",
        "legendFormat": "Gen {{generation}}"
      }]
    },
    {
      "id": 9,
      "title": "Document Processing Rate",
      "type": "timeseries",
      "gridPos": {"x": 0, "y": 20, "w": 12, "h": 8},
      "targets": [
        {
          "expr": "sum by (content_type) (rate(documents_processed_total{status=\"success\"}[5m]))",
          "legendFormat": "{{content_type}} success/s"
        },
        {
          "expr": "sum by (content_type) (rate(documents_processed_total{status!=\"success\"}[5m]))",
          "legendFormat": "{{content_type}} error/s"
        }
      ]
    },
    {
      "id": 10,
      "title": "Model Inference p99 (s)",
      "type": "timeseries",
      "gridPos": {"x": 12, "y": 20, "w": 12, "h": 8},
      "targets": [{
        "expr": "histogram_quantile(0.99, sum by (le, model_name) (rate(model_inference_duration_seconds_bucket[5m])))",
        "legendFormat": "{{model_name}} p99"
      }]
    }
  ]
}

Prometheus + docker-compose: Complete Setup

# config/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 30s
  external_labels:
    environment: "production"
    cluster: "primary"

scrape_configs:
  - job_name: "document-api"
    metrics_path: "/metrics"
    scrape_timeout: 10s
    static_configs:
      - targets: ["app:8001"]
        labels:
          service: "document-api"

To verify your metrics are being scraped:

# Local development
curl http://localhost:8001/metrics | grep documents_processed

# PromQL via HTTP API
curl 'http://localhost:9090/api/v1/query?query=up{job="document-api"}'

Interview Questions and Answers

Q1: A Prometheus histogram has buckets at [0.1, 0.5, 1.0, 5.0, +Inf]. Your SLO is p99 < 800ms. Can you compute an accurate p99 from this histogram?

No. Prometheus computes histogram_quantile by linear interpolation within the bucket that contains the quantile. If the true p99 is 800ms, it falls in the bucket (0.5, 1.0]. Prometheus will linearly interpolate between 0.5s and 1.0s, giving an answer somewhere in that range, but it cannot tell you the exact value is 800ms. For an SLO target of 800ms, you need a bucket boundary at exactly 0.8 (or close to it) to get accurate alerting. Add 0.8 to your bucket list.

Q2: Your team wants to track which user ID is making the slowest requests by adding user_id as a Prometheus label. Why is this problematic, and what is the right approach?

Adding user_id as a label creates one time series per unique user ID. With 1 million users, you get 1 million time series just for that metric. Prometheus stores all active time series in RAM - this would consume gigabytes of memory and make Prometheus unusable. The right approach: keep user_id in logs (where cardinality is unlimited), and use a Prometheus histogram to track the distribution of slow requests. To find slow users, query logs with duration_ms > 2000 in Loki/Kibana. If you need per-user metrics for billing or SLAs, use a dedicated time series database like InfluxDB or a columnar store, not Prometheus.

Q3: What is the difference between rate() and increase() in PromQL, and when should you use each?

rate(counter[5m]) computes the per-second average rate of increase over the time window. It handles counter resets (process restarts). increase(counter[5m]) is just rate(counter[5m]) * 300 (the window in seconds) - it gives the total increase over the window. Use rate() for alerting rules and graphs (it gives a stable per-second value regardless of window size). Use increase() when you want to see "how many events happened in the last hour" in a human-readable form. Never use increase() for alerting - it is not normalized to a rate.

Q4: How does histogram_quantile work across multiple service replicas in Kubernetes?

Because histograms aggregate at the bucket level, you can sum buckets across replicas before calculating the quantile. The correct query is:

histogram_quantile(0.99,
  sum by (le) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

The sum by (le) adds together the bucket counts from all replicas for each le value, then histogram_quantile computes the quantile from the combined distribution. This is why histograms are preferred over summaries for multi-instance deployments - Summary quantiles are computed per-instance and cannot be meaningfully summed.

Q5: Your Python service starts up and immediately registers dozens of Prometheus metrics. Another service team complains that their metrics are appearing in your /metrics endpoint. Why, and how do you fix it?

Prometheus uses a global default registry (prometheus_client.REGISTRY). All metrics registered with Counter(...), Gauge(...), etc. are added to this global registry. If your application imports a shared library that also registers metrics, or if multiple application modules are loaded in the same process, they all share the registry. Two fixes: (1) Create a custom registry: registry = CollectorRegistry() and pass it to every metric: Counter("name", "help", registry=registry). Then expose it: generate_latest(registry). This is the cleanest solution for libraries. (2) Unregister unwanted collectors: REGISTRY.unregister(PROCESS_COLLECTOR). For application code, the global registry is usually fine; the problem typically indicates a dependency boundary issue that should be solved with custom registries.

What You Will Learn​

Prerequisites​

The Incident: 3 AM, p99 > 2 Seconds​

1. The Prometheus Data Model​

Cardinality: The Critical Constraint​

The Prometheus Scrape Model​

2. Counter​

Using Counters​

PromQL for Counters​

Counter Pitfalls​

3. Gauge​

Using Gauges​

PromQL for Gauges​

4. Histogram​

How Histograms Work​

Defining Histograms with the Right Buckets​

Using Histograms​

PromQL for Histograms​

Choosing Bucket Boundaries​

5. Summary: When to Use vs Histogram​

6. FastAPI Auto-Instrumentation​

What the /metrics Endpoint Produces​

7. Custom Application Metrics​

Complete Metrics Module for a Document Processing Service​

Using Metrics in Application Code​

8. PromQL Essentials​

9. Alerting Rules​

docker-compose Setup​

Five Production Alerting Rules​

Alertmanager Configuration​

10. Grafana Dashboard​

Prometheus + docker-compose: Complete Setup​

Interview Questions and Answers​